arm64: optimize q4_k_q8_k kernel with i8mm #13886

cyb70289 · 2025-05-29T10:35:39Z

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.

34% ~ 50% S_PP uplift for all batch sizes
12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------

Make sure to read the contributing guidelines before submitting a PR

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- | PP | TG | B | S_PP t/s | S_TG t/s | | | | | original | this pr | original | this pr | |-------|--------|------|----------|----------|----------|----------| | 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 | | 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 | | 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 | | 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 | | 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 | | 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 | --------------------------------------------------------------------- ```

hariharans29 · 2025-08-07T07:06:41Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

    uint32_t utmp[4];

+#if defined(__ARM_FEATURE_MATMUL_INT8)
+    if (nrc == 2) {


@cyb70289: Naive question - If I understand correctly, this is the number of rows and if it has to be 2 to use SMMLA how come we see gains with Batch size 1 in Prompt prefilling ?

Prompt prefill is different from token generation. In PP, all the tokens are process at once, the activation shape is [batch_size, prompt_tokens, embedding_size]. So I8MM is always useful for PP even if batch=1 (unless the prompt has only one token). For TG, the activation shape if [batch_size, 1, embedding_size], I8MM only works for batch > 1.

Thank you very much @cyb70289 for taking the time to respond. That makes sense.

May I ask - what is nrc in the context of this micro-kernel ? Is it row count of the tile that this micro kernel is processing ? So, if I understand the I8MM path is triggered for cases where row count in the tile is == 2 ?

IIUC, this nrc is a constant, either 1 or 2, as the updated type_traits_cpu[] in this patch. It indicats the maximal rows this kernel can handle in oneshot. It's not related to tensor shape. But it can be reduced to 1 when the tensor is just a vector, even if the kernel can handle 2.

The framework will feed the kernel with appropriate nrc rows of data based on its reported capability and actual data shape.

Got it - thanks. So basicaly the SMMLA is in action only when nrc is literally just 2. Thanks

Nor7th · 2025-08-13T08:05:21Z

@cyb70289 Hi, I'm testing this patch on an N2 machine on deepseek q4_k model. Seems that sometimes it goes into your optimized branch, sometimes it falls back to the SVE branch, is this normal?

cyb70289 · 2025-08-13T10:03:07Z

@cyb70289 Hi, I'm testing this patch on an N2 machine on deepseek q4_k model. Seems that sometimes it goes into your optimized branch, sometimes it falls back to the SVE branch, is this normal?

What's the batch size? For single batch, only the prompt prefill stage may enter the optimized path.

cyb70289 force-pushed the q4k branch from f01483b to a85e6bc Compare May 29, 2025 10:37

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 29, 2025

ggerganov approved these changes May 29, 2025

View reviewed changes

ggerganov merged commit 54a2c7a into ggml-org:master May 29, 2025
46 checks passed

cyb70289 deleted the q4k branch May 29, 2025 12:04

hariharans29 reviewed Aug 7, 2025

View reviewed changes

fj-y-saito mentioned this pull request Aug 13, 2025

arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277

Merged

DajanaV mentioned this pull request Oct 28, 2025

UPSTREAM PR #15277: arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… auroralabs-loci/llama.cpp#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

arm64: optimize q4_k_q8_k kernel with i8mm #13886

arm64: optimize q4_k_q8_k kernel with i8mm #13886

Uh oh!

cyb70289 commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

hariharans29 Aug 7, 2025

Uh oh!

cyb70289 Aug 7, 2025

Uh oh!

hariharans29 Aug 7, 2025

Uh oh!

cyb70289 Aug 8, 2025

Uh oh!

cyb70289 Aug 8, 2025

Uh oh!

hariharans29 Aug 8, 2025

Uh oh!

Nor7th commented Aug 13, 2025 •

edited

Loading

Uh oh!

cyb70289 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arm64: optimize q4_k_q8_k kernel with i8mm #13886

arm64: optimize q4_k_q8_k kernel with i8mm #13886

Uh oh!

Conversation

cyb70289 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hariharans29 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

cyb70289 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

cyb70289 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

cyb70289 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Nor7th commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyb70289 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyb70289 commented May 29, 2025 •

edited

Loading

Nor7th commented Aug 13, 2025 •

edited

Loading